System and Method of Accelerating Document Processing

ABSTRACT

Embodiments include methods and systems for processing XML documents. One embodiment is a system that includes a tokenizer configured to identify tokens in an XML document. A plurality of speculative processing modules are configured to receive the tokens and to at least partially process the XML document and to provide data indicative of the XML document. A first module is configured to perform further processing of the XML document using the data indicative of the XML document and configured to output the processed XML document. Each of the plurality of speculative processing modules is configured to asynchronously provide the data indicative of the XML document to the first module. Other embodiments include method and systems for performing the speculative processing.

BACKGROUND OF THE INVENTION

1. Field of the Invention

The invention relates to a system and method for processing structureddocuments such as extended markup language (XML) documents.

2. Description of the Related Technology

Extensible markup language (XML) is a data description language thatprovides a mechanism to represent structured data in way that retainsthe logical structure and interrelationship of the underlying data. InXML, data is represented as Unicode text using standardized markupsyntax to express the structural information about that data. In brief,XML syntax includes tags (a string bracketed by ‘<’ and ‘>’) andattributes (syntax of the form attribute_name=“value”). The particulartags and attributes used in a document may be selected with reference tothe type of data that is represented by a particular document. Moreover,an XML document may be constructed to conform to a document typedeclaration (DTD). A DTD is a formal description of a particular type ofdocument. It sets forth what elements the particular type of documentmay contain, the structure of the elements, and the interrelationship ofthe elements.

While XML is human readable, XML documents, particularly those whichconform to a well-known or standardized DTD, provide a convenient meansof data exchange between computer programs in general, and on theInternet in particular. However, many of XML's features, as well as theuse of text and the structures encoded within the text, make XMLdocument processing processor intensive. Thus, in systems that exchangea high volume of XML data, e.g., e-commerce systems that process XMLencoded security data, XML processing may tend to consume so much of aserver's processing power that the amount of processing power remainingto actually apply the XML data to the relevant application may beimpacted. One solution to this problem is to offload processing of XMLqueries to dedicated content processors that employ hardwarespecifically configured to process XML. However, the memory andprocessor requirements associated with XML processing have limited thecost effective implementation of content processing for XML queries.Thus, simpler yet resource efficient systems and methods of processingXML documents are needed.

SUMMARY OF CERTAIN INVENTIVE ASPECTS

The system, method, and devices of the invention each have severalaspects, no single one of which is solely responsible for its desirableattributes. Without limiting the scope of this invention as expressed bythe claims which follow, its more prominent features will now bediscussed briefly. After considering this discussion, and particularlyafter reading the section entitled “Detailed Description of CertainEmbodiments” one will understand how the features of this inventionprovide advantages that include faster and more efficient processing ofXML documents.

One embodiment is system for processing XML documents. The systemincludes a tokenizer configured to identify tokens in an XL document.The system further includes a plurality of speculative processingmodules configured to receive the tokens and to at least partiallyprocess the XML document and to generate data indicative of the XMLdocument. The system further includes a first module configured toreceive the data indicative of the XML document and configured toperform further processing of the XML document using the data indicativeof the XML document and configured to output the processed XML document.Each of the plurality of speculative processing modules is configured toasynchronously provide the data indicative of the XML document to thefirst module.

Another embodiment includes a method of processing a document havingstructured data. The method includes determining a first indicator thatidentifies structure of a first document, determining at least oneproperty of the first document, and storing the first indicator with theat least one property. In one embodiment, the method further includesdetermining a second indicator that identifies structure of a seconddocument, matching the second indicator to the first indicator, andretrieving the at least one property stored with the first indicator.

Another embodiment is a content processor containing software defining aprocess which when executed causes the content processor to perform theacts of: determining a first indicator that identifies structure of afirst document having structured data, determining at least one propertyof the first document; and storing the first indicator with the at leastone property.

Another embodiment is a method of searching in structured documents. Themethod includes storing a plurality of isomorphic digest values in adata structure that associates each of the plurality of isomorphicdigest values with data indicative of at least a portion of a respectivehierarchical structure. The method further includes identifying aportion of an XML document. Said portion may comprise a hierarchicalstructure. The method further includes determining an isomorphic digestindicative of the portion. The method further includes identifying theisomorphic digest value with a stored one of the plurality of isomorphicdigest values. The method further includes outputting the associateddata indicative of the at least a portion of a respective hierarchicalstructure.

Another embodiment is a method of transforming an XML document into apredetermined format. The method includes identifying at least a portionof the XML document that is noncompliant with the predetermined format.The method further includes transforming the portion of the XML documentinto compliance with the predetermined format using the one of theplurality of transformations and outputting the transformed portion ofthe XML document when the portion of the XML document can be transformedinto compliance using at least one of plurality of transformations. Whenthe portion of the XML document cannot be transformed into complianceusing at least one of plurality of transformations, the method furtherincludes outputting data indicative of the portion of the XML document.

Another embodiment is a system for transforming an XML document into apredetermined format. The system includes means for identifying aportion of the XML document that is noncompliant with the predeterminedformat, means for transforming the portion of the XML document into thepredetermined format using at least one one of the plurality oftransformations and providing the transformed portion when the XMLdocument can be transformed into compliance using the at least one ofplurality of a transformation, and means for outputting data indicativeof the portion of the XML document when the XML document cannot betransformed into compliance using at least one of plurality oftransformations.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 is a block diagram illustrating an exemplary system forprocessing XML documents.

FIG. 2 is a flowchart illustrating an exemplary method of using a systemsuch as illustrated in FIG. 1 to identify and cache documents havingsubstantially similar structures.

FIG. 3 is a flowchart illustrating an exemplary method of using a systemsuch as illustrated in FIG. 1 to define points of interest in XMLdocuments.

FIG. 4 is a flowchart illustrating an exemplary method of identifyingpoints of interest, such as defined by the method of FIG. 3, in a XMLdocument.

FIG. 5 is a flowchart illustrating an exemplary method ofcanonicalization processing of an XML document in a system such asillustrated in FIG. 1.

FIG. 6 is a block diagram illustrating in more detail one embodiment ofthe system of FIG. 1.

DETAILED DESCRIPTION OF CERTAIN EMBODIMENTS

The following detailed description is directed to certain specificembodiments of the invention. However, the invention can be embodied ina multitude of different ways as defined and covered by the claims. Inthis description, reference is made to the drawings wherein like pailsare designated with like numerals throughout.

Processing of XML documents is typically performed as set of sequentialsteps. The steps used in that sequence to process a particular documentmay vary with the content of the XML document. For example, afterreceiving an XML document, a system may first identity the type ofdocument. For a particular type of document, data in a certain elementof the XML document may be desired. Thus, the document may be more fullyprocessed to, for example, resolve namespaces in the document toproperly identity the desired data element.

One way of improving processing performance is by using customizedhardware to perform part of the XML processing. For example, overallthroughput of an XML document processing system, in general, andapplicability of hardware acceleration for such processing, inparticular, can be improved by speculatively performing certainfunctions of processing XML documents in parallel upon receipt of suchdocuments for processing. As such documents are further processed, thedata derived by the speculatively processed functions is availablewithout delay for continued processing by other XML processingfunctions.

Moreover, XML documents in many applications are computer generated.Computer generated XML documents tend to have a very consistentstructure from document to document, even though the actual data contentitself varies. For example, XML documents that describe an “employee,”e.g., produced to comply with a DTD that includes an “employee,” may bestructurally identical, e.g., providing the same elements of eachemployee with the same structure and with the same order within thestructure, with only the actual attribute values of each documentvarying. More generally, most XML documents received by a system forprocessing are in fact well formed, valid, designed against an availableschema, in canonical XML form, and include namespaces that are known inadvance. Thus, the overall processing performance of XML documents canbe simplified and thus accelerated by assuming that documents are insuch regular forms until data in the document shows otherwise. Inparticular, such documents can be processed more efficiently byidentifying documents that share the structure of a previously processeddocument so that data structures or other data associated with the priorprocessing of structural data can be reused.

FIG. 1 is a block diagram illustrating an exemplary system 100 forprocessing XML documents. In one embodiment, the system 100 includes oneor more processors in communication with memory and storage. The system100 includes speculative processing modules 102 that process functionsof an XML document received by the system 100 and provide data regardingthose functions to an XML processing module 104. In one embodiment, oneor more of the speculative processing modules 102 execute in parallel,In one embodiment, the system 100 includes one or more general purposeprocessors and one or more content processors configured to executesoftware instructions for performing at least some of the functionsattributed to the system 100. In one embodiment, the content processorsmay include general purpose microprocessors or other digital logiccircuits, e.g., a programmable gate array (PGA) or an applicationspecific integrated circuit (ASIC) configured to perform at least someof the functions attributed to the system 100. In one embodiment, thespeculative processing modules 102 are executed in parallel on or moreof the general purpose processors and/or the content processors. Thespeculative processing modules 102 perform particular functions in theprocessing of XML documents that may or may not be needed for any givendocument and provide that data asynchronously, e.g., without the databeing requested, to the XML processing module 104 for furtherprocessing. However, by performing portions of the processing inparallel and asynchronously, overall processing time, e.g., systemlatency, of XML documents can be decreased as compared to serialprocessing where particular functions of processing are performed onlyas needed. Moreover, by selecting suitable functions to performspeculatively, such functions can be implemented in hardware to furtherimprove performance.

hi one embodiment, the speculative processing modules 102 include one ormore of a tokenizer 110 that processes XML documents into syntactictokens, a well-formed document check module 111 that at least partiallydetermines whether the XML document is well formed according to the Xspecification, and a validator 112 that at least partially determineswhether the XML document is valid. hi one embodiment, the tokenizer 110may include hardware for performing tokenization such as described inU.S. patent application Ser. No. 10/831,956 entitled “SYSTEM AND METHODOF TOKENIZING DOCUMENTS,” filed Apr. 26, 2004, and incorporated byreference in its entirety. In addition, U.S. patent application Ser. No.10/774,663, filed Jul. 2, 2004, which is hereby incorporated byreference in its entirety, describes one embodiment of the validator 112that identifies documents that are not well-formed by using statisticsobtained by, for example, the tokenizer 110.

In one embodiment, the speculative processing modules 102 include adigest hashing module 114 that maintains a cache of data associated withXML documents having particular structures. The speculative processingmodules 102 may also include a conformance engine 116 that determineswhether a document conforms to a particular structure, and identifiesdata in the document that does so conform.

The speculative processing modules 102 may also include a type inferencemodule 120 that infers the types of attributes of processed XMLdocuments. In one embodiment, the speculative processing modules 102include a canonicalization (AS4N) module 118 that determines and whetheran XML document is in a particular “canonical” form, such as thecanonical XML form defined by the World Wide Web Consortium (W3C). Thespeculative processing modules 102 may also include a namespaceresolution module 124 that resolves names within the XML documentstructure according to the proper namespace.

In operation, an XML document is received by the system 100 forprocessing. Each of the speculative processing modules 102 processes theincoming XML document. In one embodiment, each of the speculativeprocessing modules 102 processes the XML document concurrently or inparallel using multiple processors and/or digital logic circuits. Theresults of the speculative processing are provided to the XML processingmodule 104 for further processing. As the XML processing module 104proceeds with processing, processed data provided by the speculativeprocessing modules 102 is available to the XML processing module 104without further processing delay. For example, the XML processing module104 can obtain data already located in the document by the conformanceengine 116.

In one embodiment, the digest caching module 114 is configured tocalculate a hash or digest value that identifies the structure of an ALdocument. The digest value is then stored in a data cache so as toidentify cached data associated with the structure or other propertiesof the document. In one embodiment, the cache is a hashtable thatidentifies the cached data with the respective digest value. The cacheddocument properties may include data or data structures derived byprocessing the XML document. When a second XML document is receivedhaving the same stricture, and thus the same digest value as the firstXML document in the cache, the cached data may be used to process thesecond document without requiring additional processing time to derivethe cached data.

FIG. 2 is a flowchart illustrating an exemplary method 150 of using thesystem 100 to identify and cache documents having substantially similarstructures. In one embodiment, data obtained during processing of afirst document is cached for later use while processing later receiveddocuments that have substantially the same XML structure. The method 150begins at a block 152 in which the system 100 receives an XML document.Next at a block 154, the system 100 processes the XML document, e.g.,performs initial tokenization and parsing, to determine the structure ofthe document. Moving to a block 156, the digest caching moduledetermines an isomorphic digest value indicative of the structure of theXML document. Two XML documents may be referred to as being isomorphicif the documents have the same structure. A digest value refers to arepresentation of the document in the form of a short string or numericvalue such as calculated by a one-way or hash function. Generally, theeffectiveness of a hash table used to cache data may be limited by theeffectiveness of the hash function for uniquely identifying the databeing hashed. In one embodiment, a cryptographic hash function, e.g.,SHA-1, is used to calculate the digest value for each document. By usingsuch a hash function, there is no reasonable likelihood that twodifferent structures hash to the same value.

An isomorphic digest value is thus a digest value that when calculatedfor two documents is the same if those two documents have the samestructure and different if the two documents have different structure.In one embodiment, the digest value for each document is calculatedusing a string that represents the structure of the document. In oneembodiment, the string is determined from the document by removing allwhitespace, comments, attribute values, and node text values. Additionalsymbols may be placed within the string to disambiguate elements fromattributes. Namespace declarations may include the Uniform ResourceIdentifier (URI) of the namespace. See Table 1 for an example documentand structure string.

TABLE 1 Document <doc> <e1> <e5></e5> </e1> <e2></e2> <e3attr1=“attr1value”></e3> <e4 attr1=“attr1value”></e4> </doc> Structure<doc><e1><e5></e1><e2><e3#attr1><e4#attr2></doc>

Proceeding to a block 160, the digest caching module 114 looks up thedigest value of the current document in the cache to determine whetherits digest value is stored in the cache. In one embodiment, the cacheincludes a hashtable that uses digest values as hash keys foridentifying cached data. If cached data is identified with the digestvalue of the current document, the method 150 proceeds to a block 162 inwhich the cached data, or a pointer to the cached data, is retrieved andprovided to other components of the system 100 for use in processing thecurrent XML document. Next at a block 164, the system 100 processes thecurrent XML document, using the cached data. Referring again to theblock 160, if no cache data is identified with the digest value, themethod 150 proceeds to the block 164, in which the current XML documentis processed without benefit of any cached data. Next at a block 166,the digest caching module 114 stores suitable data obtained inprocessing the document 164 in the cache and identifies that data withthe digest value. In one embodiment, the hash table stores a pointer tothe data at the entry in the hash table for the digest value in order toidentify the data with the digest value.

FIG. 3 is a flowchart illustrating an exemplary method 170 of using thesystem 100 to define points of interest in XML documents. In particular,in one embodiment, the conformance engine 116 is configured to identifypoints of interest in the structure of an XML document, e.g., todetermine whether the document conforms to a particular form. Theconformance engine 116 thus searches in parallel for all identifiedpoints of interest that may be in the document as the document isprocessed. In one embodiment, the conformance engine 116 maintains adata structure, embodied as a conformance table, for identifyingparticular XML document structures that have been defined by, e.g., auser of the system 100, as a point of interest. In one embodiment, theconformance table is a hash table. In one embodiment, the structure ofeach point of interest is stored in the hash table according to itsdigest value. In addition, in order to facilitate processing of thestructure of the document in a top down fashion as each level ofstructure is encountered, each structure above the point of interest inthe hierarchy of the document's structure is also stored in theconformance table. For example, returning to the example in Table 1, thestructure of a particular example document can be represented as astring:

“<doc><e1><e5></e1><e2><e3#attr1><e4#attr2></doc>.”

If the “e5” element represents a point of interest, then hash values foreach of the structures that are above “e5” in the hierarchical documentstructure, i.e., its “parent” structures, of “<doc>,” and “<doc><e1>”along with the structure of the point of interest, “<doc><e1><e5>” arestored in the table along with corresponding data. In one embodiment,the corresponding data for parent structures “<doc>” and “<doc><e1>” mayinclude a “further processing” token. The corresponding data for theactual point of interest structure, “<doc><e1><e5>,” may include aparticular token identifying the point of interest, e.g., a numeric codethat, for example, indexes a table containing other data relating to adefined group of points of interest. In one embodiment, thecorresponding data includes a pointer to a data structure containingfurther information about the point of interest. In another embodiment,the corresponding data may include a pointer to instructions that areexecuted upon identifying the point of interest.

The method 170 of generating the entries for a particular point ofinterest in the conformance table begins at a block 172 in which theconformance engine 116 receives a nested structure identifying a pointof interest; For convenience of description, the levels of nestedstructure may be referred to by a number from 1 to N, with the level 1being associated with the top most level of the structure and level Nbeing associated with the most deeply nested level of structure. Forexample, using the example document of Table 1 and the example point ofinterest at “<doc><c1><e5>”, the “<doc> level may be identified as level1, the “<e1>” entity as level 2 and the “<e5>” entity as level 3, withN=3.

Next at a block 174, the conformance engine 116 determines a digestvalue for each level of the structure. In one embodiment, theconformance engine 116 calculates a hash value for the string indicativeof each level of structure. For example, a hash value may be calculatedfor “<doc><e1><e5>” at level N=3, the “<doc><e1>” at level 2 and “<doc>”at level 1. In one embodiment, the hash value is calculated using acryptographic hash function such as SHA-1.

Proceeding to a block 176, the conformance engine 116 identifies thedigest values for each of the structure levels, except for the fullstructure, N, with a “further processing” indicator or token. In oneembodiment, the conformance engine 116 identifies the digest values withthe structure levels by storing the further processing token in a hashtable using the digest value as the hash key. For example, continuingwith the example of Table 1, the hash table may include entriesassociating the structure strings “<doc>” and “<doc><e>” with the“further processing” token.

Next at a block 178, the conformance engine 116 determines the digestvalue of the full structure of the point of interest, e.g., structurelevel N. with data for processing the point of interest. In oneembodiment, the conformance engine 116 stores the data in a hash tableusing the digest value of the structure string as the hash key. In oneembodiment, the data for processing the point of interest includes dataidentifying the point of interest, such as a token or numeric referenceindicator associated with the point of interest. In another embodiment,the conformance engine stores in the conformance table one or morepointers to data associated with the point of interest. In oneembodiment, the pointers may include a pointer to executableinstructions for processing the point of interest. For example, theconformance engine 116 calculates the hash key for the string“<doc><e1><e5>” and stores a pointer in the hash table to data obtainedin the processing of the exemplary document.

FIG. 4 is a flowchart illustrating an exemplary method 180 ofidentifying points of interest, such as defined by the method 170, in aXML document that is being processed. Beginning at a block 182, theconformance engine 116 receives a portion of the document structure,e.g., the top level structure in an XML document. The conformance engine116 determines an isomorphic digest value indicative of the portion ofthe document structure. In one embodiment, the digest value is acalculated using a hash function, e.g., a cryptographic hash such asSHA-1. For example, after receiving the “<doc>” top-level element of adocument, the conformance engine 116 calculates a hash value for thestructure string “<doc>” associated with this substructure. Next atblock 184, the conformance engine 116 determines whether the hash keyfor the current portion of the structure has been stored. Continuingwith the example from Table 1, as discussed with reference to FIG. 3,the conformance engine 116 retrieves a “further processing” token fromthe hash table for the hash key of the string “<doc>”. If theconformance engine 116 identifies any data, e.g., finds the data in thehash table, with the digest value of the current portion of thestructure, the method 180 proceeds to a block 186. If, in the block 184,the conformance engine 116 fails to identifies any data, e.g., thedigest value is not in the hash table, the method 180 proceeds to an endstate and the conformance engine finishes processing of the portion ofthe structure.

Moving to the block 186, the conformance engine 116 determines whetherthe data identified with the digest value denotes further processing,for example, whether the data includes the “further processing” token.If further processing is indicated, the method 180 proceeds to a block188. If no further processing is indicated, the method 180 proceeds to ablock 190. Returning to the block 188, the conformance engine 116receives data regarding further structure of the document, e.g., thenext deeper level of nested structure (“<doc><e1>” in the example ofTable 1). In one embodiment, the conformance engine 116 receives one ormore tokens until the next level of structure of the current XMLdocument is defined. Next, the method 180 proceeds back to the block 182to process this next level of structure.

Returning to the block 190, as no further processing is indicated, theconformance engine 116 returns data indicative of the located point ofinterest in the document. In one embodiment, the data includes a tokenassociated with the point of interest in the document. The conformanceengine 116 outputs the token indicating the presence of the particularpoint of interest to the XML processing module 104. In anotherembodiment, the data includes a pointer to executable instructions forprocessing the point of interest. In one embodiment, the conformanceengine 116 executes these instructions. In another embodiment, theconformance engine 116 provides the pointer to the executableinstruction to the XML processing module 104 for execution.

Generally, it is possible for XML documents which are equivalent for thepurposes of many applications to differ in physical representation. Forexample, such XML documents may differ in their entity structure,attribute ordering, and character encoding. W3C canonicalization (C14N)of XML includes generating a form for an XML document in which certaindifferences in representation have been removed, for example, convertingempty elements to start-end tag pairs, normalizing whitespace outside ofdocument elements and within start and end tags, setting attribute valuedelimiters to quotation marks (double quotes), adding default attributesto each element, and imposing lexicographic order on the namespacedeclarations and attributes of each element within the output XMLdocument. The content of two or more documents in canonical XML form canbe more easily compared to each other than can documents not incanonical form in order to determine when such documents includeequivalent content. Because such documents do not include all possiblesyntactic constructions in XML, such documents can also be easier toprocess. One canonical XML format is defined by the World Wide WebConsortium, e.g., “Canonical XML, Version 1.0 (Mar. 15, 2001) availableat www.w3c.org. Methods for fully performing C14N processing may beperformed in hardware using an ASIC or a gate array. Such a hardwaresolution may not be cost effective due to the complexity of full C14Nprocessing. However, by performing only selected portions of the C14Nprocessing in hardware logic, C14N processing can be substantiallyaccelerated.

FIG. 5 is a flowchart illustrating an exemplary method 200 ofcanonicalization processing of an XML document by, for example, thecanonicalization module (C14N module) 118. In operation, the C14N module118 receives each token in the XML document from the tokenizer 110 andprocesses the token according to the method 200. hi one embodiment, theC14N module 118 includes a processor programmed to perform at least aportion of the method 200. In one embodiment, the C14N module includeshardware logic configured to perform at least a portion of the method200.

The method 200 begins at a block 202 in which the C14N module 118identifies, based on one or more recent tokens, that the document is notin C14N form. Moving to a block 204, the C14N module 118 determineswhether the non-C14N form can be corrected by one of a set of simpletransformations. Such simple transformations include extra whitespace inthe document. In one embodiment, the simple transformations includetransformations of the document that can be performed efficiently byhardware logic such as in a PGA. If the transformation is a simpletransformation, the method 180 proceeds to block 206 in which the C14Nmodule 118 performs the simple transformation. Such simpletransformations may include discarding extra whitespace (white space inexcess of that defined for canonical form) in the document or encodingspecial characters. If the transformation is not a simpletransformation, the method 180 proceeds to a block 208. At the block208, the C14N module outputs a token noting the presence, location, andother details of the non-canonical form. For example, transformationssuch as changing the order of elements to be in canonical lexicographicordering may be performed in software after the entire document isprocessed. In one embodiment, the C14N module 118 includes software forperforming such non-simple transformations. In one embodiment, the C14Nmodule 118 performs only the simple transformations, in hardware logic,during the initial speculative processing of the document. In one suchembodiment, the C14N module 118 processes the outputted tokens to morefully place the document in C14N form after the speculative processingphase.

The system 100 may speculatively perform any number of XML processingfunctions depending on the amount of system resources that are availableand the particular application. One such function may include makingtype inferences using the type inference module 120. In one embodiment,the type inference module 120 evaluates each attribute of each elementin an XML document and returns a list of possible types for each token.For example, for an attribute value containing “1” the possible typesmay include boolean (with a value corresponding to “true”) and integer(with a value corresponding to the number 1). In one embodiment, thisprocessing is performed speculatively, e.g., before the need for suchdata is identified for a particular document, and in hardware logic.Later processing of the XML document can thus be accelerated by havingattribute data at least partially processed. For example, because thedata of each attribute is processed for each possible type, anothercomponent of the system 100 can quickly and efficiently bind the valuesto a strongly typed (e.g., C or C++) data structure.

Another such function that may be speculatively processed is XMLnamespace processing. In one embodiment, the namespace resolution module124 performs namespace resolution of names in the XML documentspeculatively, e.g., before the need for such data is identified for aparticular document, and in hardware logic, so as to reduce the overalltime to process each document.

FIG. 6 is a block diagram illustrating one embodiment of the system 100.The exemplary system 100 includes a processor 302 operably connected toa network interface 304 and a memory 306. In the exemplary system 100,the processor 302 is also connected to a content processor 310. In oneembodiment, the content processor 310 includes a processor 312. In oneembodiment, the content processor 310 includes a logic circuit 314. Thecontent processor 310 may also include a memory 316. In one embodiment,the content processor 310 is embodied as an ASIC or a PGA.

In view of the above, one will appreciate that embodiments of theinvention overcome many of the longstanding problems in the art bysimplifying XML processing so that such processing can be furtheraccelerated using hardware. In particular, numerous XML processingfunctions can be performed in parallel in hardware in advance so thatdata from these functions is nearly instantly available to laterprocessing functions. Overall XML processing system throughput andlatency can thus be improved.

It is to be recognized that each of the modules described above mayinclude various sub-routines, procedures, definitional statements andmacros. Each of the modules may be separately compiled and linked into asingle executable program. The description of each of the modules isused for convenience to describe the functionality of one embodiment ofa system. Thus, the processes that are performed by each of the modulesmay be redistributed to one of the other modules, combined together in asingle module, or made available in, for example, a shareable dynamiclink library. In some embodiments, the modules may be executedconcurrently or in parallel as distinct threads or processes. Themodules may be produced using any computer language or environment,including general-purpose languages such as C, Java, C++, or FORTRAN.Moreover, the functions described with respect to the modules may beperformed, in all or in part, by either the general purpose processor302 or the content processor 310 of FIG. 6.

While the above detailed description has shown, described, and pointedout novel features of the invention as applied to various embodiments,it will be understood that various omissions, substitutions, and changesin the form and details of the device or process illustrated may be madeby those skilled in the art without departing from the spirit of theinvention. As will be recognized, the present invention may be embodiedwithin a form that does not provide all of the features and benefits setforth herein, as some features may be used or practiced separately fromothers. The scope of the invention is indicated by the appended claimsrather than by the foregoing description. All changes which come withinthe meaning and range of equivalency of the claims are to be embracedwithin their scope.

1-18. (canceled)
 19. A method of searching in structured documents, themethod comprising: storing a plurality of isomorphic digest values in adata structure that associates each of the plurality of isomorphicdigest values with data indicative of a respective portion of ahierarchical structure; determining an isomorphic digest indicative ofan XML document portion with an associated hierarchical structure;matching said isomorphic digest value of the XML document portion with astored one of the plurality of isomorphic digest values; and providingthe data indicative of the respective portion of the hierarchicalstructure associated with the matched and stored isomorphic digestvalue.
 20. The method of claim 19, wherein determining said isomorphicdigest comprises calculating a hash of a string indicative of saidhierarchical structure.
 21. The method of claim 19, wherein the step ofstoring the plurality of isomorphic digest values in the data structurecomprises storing at least one isomorphic digest value in the datastructure for each level of the hierarchical structure.
 22. The methodof claim 19, further comprising processing the XML document using theprovided data.
 23. The method of claim 19, wherein the data indicativeof the respective portion of the hierarchical structure comprises atoken indicative of the hierarchical structure.
 24. The method of claim23, further comprising based on the token indicative of the hierarchicalstructure further processing of the XML document. 25-33. (canceled) 34.A machine-readable storage medium, having encoded thereon program code,wherein, when the program code is executed by a machine, the machineimplements a method for searching in structured documents, comprisingthe steps of: storing a plurality of isomorphic digest values in a datastructure that associates each of the plurality of isomorphic digestvalues with data indicative of a respective portion of a hierarchicalstructure; determining an isomorphic digest indicative of an XMLdocument portion with an associated hierarchical structure; matchingsaid isomorphic digest value of the XML document portion with a storedone of the plurality of isomorphic digest values; and providing the-dataindicative of the respective portion of the hierarchical structureassociated with the matched and stored isomorphic digest value.
 35. Thestorage medium of claim 34, wherein determining said isomorphic digestcomprises calculating a hash of a string indicative of said hierarchicalstructure.
 36. The storage medium of claim 34, wherein the step ofstoring the plurality of isomorphic digest values in the-data structurecomprises storing at least one isomorphic digest value in the datastructure for each level of the hierarchical structure.
 37. The storagemedium of claim 34, further comprising processing the XML document usingthe provided data.
 38. The storage medium of claim 34, wherein the dataindicative of the respective portion of the hierarchical structurecomprises a token indicative of the hierarchical structure.
 39. Thestorage medium of claim 38, further comprising, based on the tokenindicative of the hierarchical structure, further processing of the XMLdocument.
 40. Apparatus for searching in structured documents,comprising: a database having a plurality of isomorphic digest values ina data structure, each of the plurality of isomorphic digest valuesassociated with data indicative of a respective portion of ahierarchical structure; a processor configured to i) determine anisomorphic digest indicative of an XML document portion with anassociated hierarchical structure; ii) match said isomorphic digestvalue of the XML document portion with a stored one of the plurality ofisomorphic digest values; and iii) provide, in concert with thedatabase, the-data indicative of the respective portion of thehierarchical structure associated with the matched and stored isomorphicdigest value.
 41. The apparatus of claim 40, wherein said isomorphicdigest comprises a hash of a string indicative of said hierarchicalstructure.
 42. The apparatus of claim 40, wherein the step of storingthe plurality of isomorphic digest values in the-data structurecomprises storing at least one isomorphic digest value in the datastructure for each level of the hierarchical structure.
 43. Theapparatus of claim 40, further comprising processing the XML documentusing the provided data.
 44. The apparatus of claim 40, wherein the dataindicative of the respective portion of the hierarchical structurecomprises a token indicative of the hierarchical structure.
 45. Theapparatus of claim 44, further comprising, based on the token indicativeof the hierarchical structure, further processing of the XML document.